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ABSTRACT 

Des igned for persons with no prior knowledge of 
statistics, this continuing education course syllabus presents basic 
information on methods of data collection and analysis in libraries. 
At is noted that emphasis is placed on concepts rather than 
mathematical formulas and on reasons for us ing particular techniques . 
Topics covered include problem definition; study design; the 
advantages and disadvantages of using direct observation, historical 
records, published surveys , interviews, and questionnaires for data 
collection; the design, administration, and evaluation of 
questionnaires; random sampling; the tabulation and graphical 
representation of descriptive statistics ; statistical estimation^ 
s tat i st ical dec i s i bns ; variance tests. (the F test arid analysis of 
variance);, the _Chi-Square _ test ; correlation; arid regression. Examples 
are provided of the use of data collection arid arialysis techniques in 
libraries. Also provided are a series of 12 problems to 
a list of important equations , an 11-item bibliography, 
list of journals that report the results of research in 
science. (ESR) 
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41: Introductory Data Collection and Analysis 



This course is intended as an introduction to methods of data 
o l lection and arity^is in libraries. No pr ,- or knowledge of statistics 

s asspf nod . 

The topics include d should provide background both for persons 
interested in conducting simple quantitative studies themselves and for 
hose who are interested in better understanding quantitative articles 
n the literature, 

[ "mphasis will be given to the questionnaire and direct sampling 
net hods of data collection. Data analysis methods discussed will 
nclude descriptive statistics - y regression -> correlation and tests of 
significance. The use of these techniques in a library setting will 
10 discussed with examples familiar to the participants. 

Concepts will be stressed rather than mathematical formulas 
although examples will be worked in class. Problem formulation 3 
reasons for using particular techniques, advantages and disadvantages 
Df the various methods, and the interpretation and presentation of 
results will be emphasized. 



Course objectives 



At the conclusion of the course ^ participants will be able to: 

1 . design and administer a simple questionnaire or direct 
sampling investigation using appropriate methodology; 



2. critically evaluate questionnaire i and direct sampling 
studies as reported in the libra-y literature; 

3. evaluate and improve existing data collection and 
analysis methods within their own institution . 
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INTRODUCTION 



The cou rse i s ihlii ndod to serve as e n introduction to the 
collection and analysis of data as ah aid to decision-making . 
Terminology and applications are the main focus, the intention 
being to provide you with a starting point for you to read and 
evaluate the literature of quantitative techniques as used in 
libraries . 

If you want to lea^h more , choose really introductory book* 
(some examples are given in the bibliography) and read from 
several different sources. ft is advantageous to see the same 
concepts expressed in different ways, 

A word of advice: if you are interested in conducting subs tan 
tial studies, net someone with some experience co help you. There 

should be someone around your work place who routinely uses 
statistical techniques. This person wilt probably be willing to 
advise you in designing your study providing Lhat you actually do 
the work. 

As for actually solving mathematical problems, there are 
many different computer programs: such as Omnitab and S PSS 



which can perform the actual calculations necessary in solving statistical 
prohinms; So dorVt get mired in the mat hi -matics of the problems you 
< .!vp; ihwnviv, you iMust know whci t: techniques to use; unci how to interpret 
the results of the computer programs. 

It is important that you read this Syllabus before you come to 
the session. Exercises in the Syllabus as well as additional example 
problems will be gone over in class. Class attendees will have the 
opportunity to discuss problems particular to their institutions as 
time permits. It will be helpful to bring a pocket calculator. 
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OUTLINE 

I . Data Collection and Analysts 

A. Problem Definition/Purpose of Study 

B. Study Design 

C. Data Collection 

"i . Data Sources 

a. Direct Observation 

b. Historical Records 

c. Published Surveys 

d. Irterviews 

e . Questionnaires 

K Dn-Response Problem in Questionnai res 
Other Guidelines for Questionnaires 
2. Introduction to Random Sampling 

D. Data Analysis 

1 „ Descriptive Statistics 

2. Statistical Estimation and Statistical Decision 

3. Variance Tests 

a, F Test 

b. Analysis of Variance 

4 . Chi-Scuare Test 



iii 
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5 , Correlation 

6. Regression 

E . Presentation of Results 
If; Problems 
7 £ I . Equations 
IV. Recommended Readings 



V 



iv 
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Data Collection and Analysis 
Problem Definition/Purpose of Study 

Collection and analysis of data (statistics) 
is merely a tool; as such, it can be used for good 
or evil. The real power of this tool depends upon 
its ability to shed light on a particular problem. 
Defining the problem or determining the goal of the 
study is the most crucial step in the entire process. 
It is alt too easy to become overly concerned with 
numbers and forget the real problem. 

There are certainly many ways to solve 
problems^ for example , accepting the advice of a 
trusted person, reading the literature, or using 
quantitative methods. Numerical methods are not 
necessarily better than others. The fact that a 
technique is quantitative does not mean that it is 
"scientific". A good quantitative study is appropri- 
ately designed and addresses the real problem to be 
solved . 

This course introduces some techniques 
which may help you glean from data infor- 
mation that will help solve a problem. One must 

1 



keep in mind that there is a lot of judgment involved in 
applying quantitative methods, because the right problem 
has to be addressed. A very real danger in defining a 
problem is "suboptimizatibn " . This means that you 
doh T t look far enough for the problem and therefore 
concentrate on only a portion of the problem or even on 
the wrong problem. If the problem that you have de- 
fined is not the real problem but only a symptom of it, 
all the sophisticated mathematical analyses in the world 
will not help you solve the problem that you wanted to 
attack. For example, you might define the problem as 
being how to encourage people to read more books, 
whereas the broader problem, and the one you should be 
studying is how can more people be educated. The 
latter problem would lead to investigation of methods 
of education, non-print media, books, computer re- 
trieval systems and so forth instead of merely pur- 
chasing more books or initiating reading programs. 

In summary, the first step in any quantitative 
study is Lo define the problem to be investigated, thus 
clarifying the objectives of the study. 

Class participants should bring some examples 
of problems to be discussed. 
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Study Design 

After the purpose of the study has been defined* 
the study is designed. The following questions should 
be answered before beginning a study: 

Questions Related to Background and Problem Definition 
1 • Is the study necessary? 

2. How much is known about the subject from the 
literature or other studies? 

3, What information is to be determined from the 
study; i.e., what type of information is required 
to achieve the purpose of the study? 

Questions Related to Data Collection and Analysis 



4. Where and how can that information be obtained? 

5. What is the target population? 

6 • Should there be a sample or a complete census? 

7. If you sample, how important is it that the 
information be accurate, and how much time and 
money are available? 

8. What type of data source is most appropriate? 

9. What type of analysis techniques would be useful? 



Once the problem is defined and steps 1 and 2 
completed, the appropriate data to solve the problem 
must be isolated. Consider a fairly tightly con- 
strained problem such as selection of a jobber for 



n 

journal purchase; What informaticn i ; - rinocled to 
solve the problem? hirst, determine what you 
think is most important to make a jobber effective. 
! or' example, you might choose: 

1 . Reliability 
'd . C o s t 

3. Time i rom order to receipt 

If it is not possible to measure some aspect 
of effectiveness, a surrogate measure can be chosen 
For "example, reliability is difficult or impossible 
to measure, so the number of claims made might 
be substituted for lack of a better measure. Thus, 
in evaluating one jobber's performance against 
another's, the time, cost, and number of claims 
might be used for comparison. 

Now steps 4 through 7 must be completed. 
Try to choose the most cost-beneficial method of 
getting the information. Additional accuracy in 
data often is obtained only at additional cost which 
may or may not be justified by the use to which you 
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pat the data. For example, to address the problem; 
"how is the library meeting the needs of the users?" 
You might consider the following data collection 
methods: 

1 . Ask every user exiting the library a group of 

well-selected questions 

2. Take a sample of users and ask them certain 
questions 

3. Put up a suggestion box 

4. Wait for comments by users. 

The first approach requires quite a bit of time 
for carefbtty designing the questions, asking them, 
and analyzing the results as well as time from the 
users. The second method takes time for designing a 
sampling technique but less time than questioning all 
users. With a carefatly designed sample and good 
questions, quite accurate results can be obtained at a 
lower cost than by the first method. The third method 
is cheap but will definitely result in biased results. 
For example, the suggestions will be based on the 
expectations of the users. Users will probably not 
make a suggestion unless they feel the library can make 
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a change. They rright suggest longer operating hours 
but not better reference service because they have 
grown to accept mediocre service ; The last approach 
will probably result in even more of a biased result 
than the third . 

In summary, good study design means choosing 
the method which will get the most useful information 
in the most accurate and cheapest way for a particular 
problem, defining the population from which data is to 
be collected, and selecting a data analysis technique. 
The study design stage involves planning the entire 
study* in other words, through the data analysis stage . 

Consider the problem which has been 
defined as: 

How much card catalog space will be required 
in five years? 

Go through steps 1 through 7 with this problem. 
Could the problem be defined in another way? 

The class will have time to discuss some problems 
of special interest to them . 
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C . Data Collection 

Data may be obtained either by a complete 
count (census) or a sampling of the group being investi 
gated. In practical cases a census is often unwieldy 
because of the size of the group to be investigated; a 
correctly taken sample can yield accurate enough re- 
sults at a much lower cost and in a shorter time than 
a census. In fact, sometimes a sample can be more 
accu rate than a census becau se the larger amount of 
data collected by a census introduce a greater chance 
of inaccuracies in the data. 

A list of data collection methods and some 
advantages and disadvantages of each follow. No 
method is particularly better than another, but the 
most appropriate source of data should be chosen 
for each problem. 
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Data Sources 

Using either a census or a sample, there 
are several sources of data: 

a. Direct Observation 
Advantages 

— All assumptions are known 

— Data are usually consistent 

— Data are usually not subject to 
misinterpretation 

Disadvantages 

— - Time-consuming 

— May be difficult to obtain a random 
sample 

b. Historical Records 
Advantages 

— Simple to collect data if it already 

exists 

— Data are usually consistent if records 
are kept consistently 
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£_ -advantage 

— The data desired may not be avail- 
able in its complete form 

c. Published Surveys 
Advantages 

— The data are already compiled, 
saving time and expense 

— The responsibility for accuracy may 
be shifted 

Disadvantages 

— The data obtained by the primary 
investigation cannot be verified 

— The statistical technique used jhay 
not be ascertainable and therefore the 

accuracy of the results may not be 

_ / 

verifiable / 

— Subjective compiling and interpreta- 
tion may have influenced the; result 
shown ; 

— The purpose of the study may have 

prejudiced the choice of material and 

i 

technique adopted j 

I 

17 \ \ 
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— A representative sample may not 
have been taken 

Inte rviews 
Advantages 

— A higher degree of accuracy is 
attained through the acquisition of 
data direct from the source 

— Data are often obtained that cannot 
be obtained through a question- 
naire alone 

— There is opportunity personalty to 
check information acquired 

— The "no response" proportion is 
usually minimized 

Disadvantages 

— Only relatively small samples can 
be gathered because of the cost 
The subjective factor is involved i 
recording by interviewer 

Questionnaires 
Advantages 

— A large area may be easily and 



1 1 

quickly covered 

— The method of collecting data is 
relatively inexpensive 

Disadvantages 

— Frequently questions cannot be 
answered without supplementary 
explanations 

— In many cases the results are unre- 
liable due to a large "non response" 

— Questions may be misinterpreted or 
answered incorrectly 

For the following problems which would be the most- 
appropriate data source and why? 

1 . What amount of space will be required for housing materials 

in five years? 

2 * How satisfied are users with the library 1 s service? 

5. What type of materials are used by practicing physicians? 

4. What should the operating budget of a nursing library be? 

Because questionnaires have been so frequently used in library 
studies, guidelines for conducting a questionnaire and a discussion of 
the problem of non-response follow: 



i 
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Mon-Response Problem in Questionnaires 

The effectiveness of the questionnaire method depends not 
only on the size of the original sample but also on the percentage 
of returns. The return rate is very important in questionnaires 
because of potential b^'as \ 1 ^h- results due to the non-respon- 
dents being different from the respondents in some important 
way(s). An attempt should be made to discover who does not re- 
turn the questionnaire and every effort made to encourage response 
in order to minimize problems with non-respondent's bias. 

Some ways of encouraging response are: 

— incentives 
— follow-ups 

--include stamped self-addressed envelope 
— guarantee anonymity 

— make questionnaire attractive and easy to complete 
--communicate the importance of the questionnaire 
— establish reasonable deadlines 

— offer the results of the survey to interested 
respondents 
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Other Guidelines for Questionnaires 

Re sure that all the questions aro necessary; there is a temptation 
to ask too much. 

Re sure that each question is Understandable and precise; ambiguity 
discourages response and accuracy. 

Define terms carefully to avoid misinterpretation . 

Pretest the questionnaire with a similar group; a pretest can 
indicate where there are problems with ambiguous questions. 

Do not ask people to answer questions they cannot. Questions 
should be easy to answer; if they require looking up from various 
sources, respondents may make up answers. 

Send the questionnaire to the person most likely to know the answer. 

Sequence the questions in a logical order; the order of questions is 
important for a train of chought. 

Precoded questions (those that provide for only certain, set 
answers) are easier :o complete than open ended ones. 

Pay attention to length and physical layout; long, unattractive 
questionnaires have lower response rates than short, attractive 
ones . 

The questionnaire should be easy to analyze. 
Questions should be objective, not u loaded u . 
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— How well does the following questionnaire Follow the guidelines? 



To: Reference Librarian, University Library 



QUESTIONS 



1 . Books 



No. of No. of 

Classification Titles Volnmes 



General Works 
Philosophy 
Social Science 
Pure Science 
Art, (Fine) 
History 
Travel 

Biography (medical) 

F iction 

Medicine 



How are paperback acquired? 
Used? 

Name five periodicals subscribed to: 



22 



15 



Uist otheri periodicals available. 



Are some periodicals maintained in back files? 



List five, 



How are they kept? 



Are visual aids available? 



Check Below 

Motion Pictures 
Film Strips 
Slides 
Recordings 

Tapes 



Number Available 



Discs 



Transparencies 
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SECTION D.1: SHOULD RE READ BEFORE THIS SECTION 

2 . Introduction to Random Sampling 

Certain statistical laws enable Us to infer 
things about a population from a sample taken 
from that population. if the sample is chosen 
according to certain rules, the accuracy of the 
information collected from the sample can be 
determined. This theory makes it possible to 
do inferential statistics — tests of significance, 
hypothesis testing, and estimation of population 

parameters from a sample. The basic concept of 

sampling theory is closely related to the normal 
distribution . 

A normal distribution is symmetrical 

(the curve looks identical on either side o^ 

the mean)-, bell-shaped-, and asymptotic to 

the horizontal axis; i.e., never touches the 

horizontal axis. 




mean 



A lot of phenomena in the world follow 



this normal distribution. Sampling theory is 



1 7 

based on the fact that the means of all possible 
sampler of a population follow a normal distri- 
bution and the mean of all possible sample 
means equals the mean of the population. 
This permits us to infer information about the 
entire population from a small sample because 
of certain known properties of the normal 
distribution . 

in order to properly utilize inferential 
statistics, the sample from which you want to 
infer something about the population must be 
representative of the population. One way to 
accomplish this objective is by simple random 
sampling (each member of the population has 
an equal chance of being selected). Two 
methods that are commonly used to select 
random samples are, (1) systematic sampling, 
(taking every "ith" item oh a list") and (2) Using 
a random number table, frequently found in the 
back of statistics books, to determine which 
items in a list to include in the sample. Both 
methods will be explained in class. 
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There are other correct methods of taking 
random samples which may be more cost- 
effective than simple random sampling, but 
they are outside the scope of this course. 

Determining an appropriate sample size 
The sample size is related to both confidence 
level (risk) and confidence interval (precision)* 
The objective of determin ^.g the sample size is 
to achieve a satisfactory degree of certainty 
(95-99% for example) that the sample estimate 
«)l :\ j>.irl iruUir r hn rnc to riritic- in error by no 
more than a specified amount. 

The equation for determining the sample 
size is: 

- 2 
n = zrst 

where n = sample size to be determined 

v = variance of population, the number repre- 
senting the population dispersion (how much the 
members of the population vary). To ascertain the 
size of the sample to achieve a specified precision, 
we must have a reasonable approximation of v in 
advance. A knowledge of the approximate 
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size of v may he available from past experi- 
ence. IT ndt > s^orhe pretexting and preliminary 
study may be necessary in order to have an 
approximate value for it. 

If the population has little variance, the size 
of the sample can be small. For example, if 
every member of the population were exactly 
alike, you could just sample one member and you 
would then know about all the others. 

e = number representing confidence 
interval (precision) required. The size of the 
sample required is tremendously affected by 
the size of e (error tolerated or precision 
required). 

2 = number representing confidence level 
(risk) required. The value selected for z 
determines the pr ob a bility that the sample 
result will have an error no greater than e . 

z and e are chosen by the investigator 
based on what the information will be used for 
and the amount of time and money available . 

Problem 3 on page 55 is related to this 
section; other examples of sample size determination 
will be worked in class . 
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Data Analysis 

Quantitative studies may descriptive such as 
describing the average salary in a sample or analytical, 
Foi" example, predicting next year's salary increase for 
a group based on samples of data. In an analytical 
study, the planning of the analysis you conduct is 
essentia 1 . Do not gather a multitude of data and then 
try to decide what to do with it. 

The type of analysis you perform with any set of 
data depends upon the quality of the data arid should not 
be more sophisticated than the data nor inconsistent 
with the aim of the study: statistical manipulation and 
decimal points can give an aura of conviction which is 
unwarranted . 

1 . Descriptive Statistics 



Descriptive statistics includes the tabu- 



lation, classification, graphical representa- 



tion, and calculation of certain summary 



values which describe group characteristics. 




21 



Distributions are graphs which describe 
the occurr&ncs of values. Histograms and tine 
graphs are used to describe values that can be 
dissected into discrete parts; curves are used to 

describe continuous values (those which can take 
on any value). 




Histogram tine Graph Curve 

Measure^ of Central 1 endency 

Measures of central tendency provide a 

single number which, in a sense, summarizes 
a group of data. The most common onec are: 

- Mean 1 - -The total of all the values divided 
by the number of values — also called 

the average. Equation = n- 

x =£ x t 

i 

n 

- Median — If all values are arranged in ascending 
or descending order, the value which is the 
middle one is the median . This number 

is not as sensitive to extremely 
large or small numbers as is 

29 
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the mean. Therefore the median may 
sometimes be more meaningful than 
the mean. 

Mode — The value which occurs most 
frequently. A distribution may have 
more than one mode . 

Measures of Dispersion 

Measures of dispersion are single 
values which describe the scatter or differ- 
ences among values. Those which are most 
useful are: 

Range — Simplest of all measures of 
dispersion, the range can be the difference 
between the largest and smallest 
values or the largest and smallest 
numbers themselves. 
Variance--The arithmetic mean of 
the squared deviations from the mean. 
Equation = ^ 




X n 

where x = the mean 

x i to n = each value 

n = number of values 
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These 
differences 



are 



squared 
and 

summed 



n samples 



If the vertical distances shown above 
fthe deviations from the mean) were just 
added and not squared before adding, the 
result would always be O. 

- Standard Deviation — The square root 
of the variance. This measure is in 
the same units as the original values, 
s = Vv~ 

Two sets of numbers may have 
the same average but be quite different. 
The set with the larger variance and 
standard deviation would be more 
scattered . 

Problems 1 and 2 on page 55 
are related to this section. 
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2 . Statistical Estimation and Statisti cal Dec isions. 

The techniques of inferential statistics pro- 
vide you with the ability to make sound conclusions 
about a population on the basis of a sample. 
Statistical inference may be used to make an esti- 
mate of a population value based on a limited 
sample (statistical estimates) or it may oe used 
to test some hypothesis by means of information 
from an experiment or samole (statistical 
decisions) . 

Statistical Estimations 

A population parameter can be esti- 
mated on the basis of a sample. For 
example, if the average price of a journal 
is unknown, an estimate can be obtained from a 
sample. The estimate of the average price of 
alt journals is obtained from the sample mean. 
The type of estimate that consists of a single 
value is called a point estimate. An interval 
estimate states in terms of probability the 

&2 
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likelihood that the population mean will occur 
in a specified range. 

An interval estimate of a sample mean is 
represented by : 

— + 

x ~ zs 

X 

where x = mean calculated from sample 

z = represents the level of confidence 

required and 

s__ = standard deviation of the sample 
x 

means; estimated by s 

Problem 4 on page 56 is related to this section. 

Statist iaaJ Dficf^tops 

Procedures which enable us to decide 
whether to accept or reject hypotheses or to 
determine whether observed samples differ 
significantly from expected results are called 
tests of hypotheses, tests of significance, or 
rules of decision. A statistical hypothesis is 
a tentative stateme^ about cne or more para- 
meters (values which describe the entire population) 
of a population or group of populations. 



2G 

In the literature you will see statements 
such as: 

~ The difference between the sample 
means is significant at ,01 = P 
~ Sample means differed at the 6.05 
level of significance. 

- Null hypothesis was rejected at the 
0,05 level . 

This section will teach you what such 
statements mean. Hypotheses are usually 
made about populations and the data gathered 
from a sample or samples of the population are 
used to test the hypotheses. Hypotheses 
are stated in terms of differences between 
populations* such as the difference between 
the population of male librarians and the 
population of female librarians. The 
numbers obtained from trie aata from the 
populations are then subjected to tests of 
significance described later. 

jf the numbers from the two samples are 
different it may be only because of 
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sampling error. You must, therefore, 
determine if there is enough difference 
between them to say that the two samples 
come from different populations. 

The null hypothesis states that there 
is no difference between the two groups. 
Rejecting the null means there is a real 
difference between the groups. However, 
an hypothesis can be rejected when it 
should have been accepted, and conceivably 
it can be accepted when it should have been 
rejected . 

These two types of errors associated 
with rejection of the null are called "Type I" 
and "Type II M . Type I error is rejecting a 
true hypothesis and Type II is accepting a 
false hypothesis. They are tabulated below. 



Accept Reject 



True 
False 



Correct 
Decision 


I 


II 


Correct 
Derri.sfon 



We usually specify the level of significance 
of a Type I error to be 0.05 or 0.01 . Five 
per-ceht (.05) means that there are five 
chances out of one hundred that we might 
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make a mistake and say there is a real 
difference between the two groups when there 
is not (reject the null hypothesis when it should 
not be). 

Not being able to reject the null 
hypothesis does not mean that it can 
necessarily be accepted. That could 
result tn a Type II error. Do not accept 
the null unless the probability of a Type II 
error is known. Just do not reject the null. 

A t-test can be used to test an 
hypothesis about the mean of a population. 
Fo^ example, the binding company says they 
can return items in an average of 12 days. 
We take a sample of ten and the mean is 
16. Could the population mean really be 12? 

h d :/l= 12 

n - 10 

x = 16 degrees of freedom = n-1 = 9 

s = 4 
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s 

t = 16-12 )f9~ 
4 

t = 3 

t at .05 significance -2.26 to 2.26 
t at ,01 significance -3.25 to 3.25 

3 > 2.26 Reject null at .05, but 
cannot reject at .01 

The parameters of two samples can 
be compared using a c-test. 

Equation: 

X 1" X 2 

t ~ ~ .--^ 

Sx 1 -x g fi/n 1 +1/ng 

where Sx^-x^ = 

I h-j-fn2~2 
degrees cf freedom = n^+n^-Z 

Problem 10 on page 59 is related to this 
section . 



n 1 s 1 



2... _ 2 
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Variance Tests 

As stated previously, standard deviations 
and variances reflect the amount of scatter in 
collected data. 

a. H Test 

The F test is used to determine whether 
one sample has more scatter than 
another or if the difference is really due 
to sampling error. Large samples do 
not require as large a F ratio for signi- 
ficance as smalt samples because there 
is less sampling error in larger samples. 



F ratio == Larger variance 
Smaller variance 



\ 

Example \ 

Medical students' circulation of 
materials: n = 41 

s 2 = v = 33,5 
Residents' circulation of materials: 

n = 50 
s 2 = v = 74.1 

F ratio = 74.1 = 2.21 
33.5 
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The F ratio is looked up in an 
F table in most statistics books using 
also the number in each of the samples. 
If the F ratio is large enough the hull 
hypothesis can be rejected and there is 
a significant difference in the scatter. 
In this case the F ratio for sample sizes 
41 and 50 is around 1 .6 which is less than 
2.21 so the null can be rejected. Problem 
6 is related to this section. 

b . Analysis of Variance (ANQVA) 

This method actually compares means of 
samples rather than variance, but the variance 
measures are used to compare the means. To 
compare the means of two samples to see if 
they differ significantly a t test could be used. 
To compare several samples A NOVA would be 
used. The purpose is to show that the variance 
estimate between the samples is significantly 
larger than the variance estimate within the 
samples. 
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A NOVA indicates if there is an overall 
difference among the means; it does not 
indicate which groups produced the difference. 
A NOVA can be used to decide whether* sample 
were drawn from the same population. 

Example of the use of A NOVA 

Three methods of supervisory counseling 
on absenteeism are being analyzed. The null 
hypothesis says there is no difference in the 
effectiveness of these methods. You would 
expect some variability among absence rates 
of the three samples just due to sampling 
error. Is the variability among absence 
rates of the samples large enough to reject 
the null and say there is a difference in the 
effectiveness of the methods? 

procedure: 

1 . Calculate within-g roups variance for 
each of the three samples, then com 
bine the three estimates to obtain 
one estimate. 



40 



33 



2. Calculate the between-g roups variance 
estimate using the mean of each of 
thie samples and compute a variance 
estimate using these three means and 
the size of the samples. 

An F ratio is computed between the 
two variance estimates using the between 
groups variance as the numerator and the 
within-g roups variance as the denominator. 
If the means of the groups really explain 
the variation (not from the same population) 
the numerator has much more variance than 

the denominator. 

F = Between— g roups variance 
Within-g roups variance 

A computer prog ram can be us "*d to 
calculate the F ratio for ANOVA. The 
results printec will usually be the ratio 
and the level r>f significance at which the groups 
differed. You must then decide if che level is 
sufficiently high to reject the null hypothesis. 

Another example of the use of ANOVA 
might he when designing an automated 
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circulation system to determine the type of 
record fo rmat one might wish to see if there 
is a difference in the average length of titles 
between journals, monographs and audiovisuals . 

Often it "is desirable to test hypotheses 
concerning two variables. In the previous 
example of the record lengthy the question was 
whether or not there were significant differences 
between types of materials. To carry this a 
step further assume there were two languages. 
The investigator might be interested in 
testing to see if there was a significant 
difference between the two languages in the 
average length of the titles. This is called 
two-way analysis of variance. 
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4. Chf Square 

The chi square test determines whether there is 
a statistically significant difference between the 
frequency of events among groups. (It does not 
reveal how much difference there is and if the test 
fails it does not really prove that the groups are the 
same. This test is used when comparing frequencies 
of events among several groups and can be used with 
qualitative data, e.g., hair color. 

Characteristics of the Chi Square Method 

— Non-parametric method, i.e. , it makes no 
assumptions about the distribution of the item 
sampled. The assumptions of non-parametric 
methods are usually easier to satisfy than 
parametric . 

— Deals with frequencies rathesr than scores. 

— Can be used with several groups and attributes. 
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The Chi Square Test can be used: 

- To test the significance of a statistic; do 
the groups differ significantly? As in all 
sampling situations, there is a possibility 
that the difference in the numbers may be 
due to sampling error. A significant dif- 
ference means that the observed frequencies 
differ enough from the expected frequencies 

( those we would expect to occur by chance) 
to justify rejection of the null hypothesis of 
no difference in the groups. 

- To test the goodness of fit. How will the 
data gathered fit a theoretical distribution; for 
example, the normal? If the data fits a stan- 
dard distribution about which certain proper- 
ties are known, those properties can be used 
for prediction . 
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Definition of Chi Square 



Chi Square is a statistic \ 




2 



which supplies a 



measure of the discrepancy existing between 



observed and expected frequencies. 



= 0 the observed and theoretical 




crepancy : At some point the null hypothesis 
of no significant difference can be rejected . 

There are Chi Square tables on the back of most 
most statistics books. You need to know the 
number of categories studied and the level of 
risk to use the table. If Chi Square is larger 
in your study than in the table the null hypothesis of 
no difference can be rejected. 
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Equation for Chi Square: 



2 n 



1 f • 

1 ei 

Example 1 

In 200 people using the library (randomly 
selected) 115 doctors and 85 nurses are found. 
The difference may be due to sampling error. 
Test the hypothesis that there really are an 
equal number of doctors and nurses using the 
library. 

"---^Observed frequencies: fo-j = 1 1 5 3 fo 2 — 85 

Expected frequencies: fe-j = 100, fe^ = 100 

^ ^ (fo^fe^ 2 + (fo 2 -fe 2 ) 2 
fe-j feg 

= (115-100) 2 + (85-1 00 ) 2 = 4.5 
100 100 

Number of categories = 2, d.f = 2-1 = 1 

'L>°&) 1 de 9 ree of * ^dom = 3.84, since 

4.5 "> 3 .84, reject null hypothesis at 0.05 
level of significance. 

^jjjj^ for 1 degree of freedom — 6.63, since 
4 .5 <C 6 .63, cannot reject null hypothesis 
at 0.01 level. 
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There probably is a significant difference 
and the re probably are not an equal number 



ever, the difference could be due to sampling 
e rro r . 



Is there a difference in the frequency of 
reading foreign language journals between 
clinicians and researchers? A random sample 
of 266 scientists was taken using the alpha- 
betical campus directory and systematic 
sampling. The results were: 

30 researchers, 40 of whirh read foreign 
language journals 

120 clinicians, 50 of which read foreign 
language journals 



of doctors and nurses using the library; how- 
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foreign 


No I o reign 


Total 


C linician 


50 


70 


120 


Researcher 


40 


40 


80 


Total 


90 


116 


200 



f Q table 



(observed 
frequencies) 



If there were no difference between the groups as 
to proportion who read foreign language journals: 



C I ihician 

Researcher 

Total 



Foreign 


No Foreign 


Total 


_ _ 54 


66 






36 


44 


.... 80 


90 


110 


200 



f e table 



(expected 
frequency 
table) 



Differences between the two tables: 



Clinician 
Researcher 



n 2 



54 



Foreign 


No Foreign— 


+4 


-4 


-4 


+4 




f-4) 2 4 ( 


36 


66 



f- - f~ table 
o e 



(+4y 
44 



2 - 



1 .34 
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What does the answer mean' 
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Since y^C. 95) = 3.84 for 1 d.f. and 
3.84 > 1.34 it cannot be said that there is a 
significant difference in the proportion of 
clinicians and researchers who read foreign 
language journals. The difference was likely 
due to sampling error . 

Problems 5 and 7 on page 56 are related to 
this section. 

Can you think of some other library problems 
for which Chi Square would be an appropriate 
data analysis technique? 
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Correlation 

Correlation indicates how related two 
variables are- One variable can be pre- 
dicted from another if they are highly re- 
lated. Correlation does not indicate a 
causal relationship. A third undiscovered 
variable may cause the effect or the re- 
lationship may be purely accidental 
(spurious correlation). 

The statistic which is the measure of 
the correlation (closeness of the relation- 
ship between two variables) is the coefficient 
of correlation (r). The coefficient of cor- 
relation is useful in judging the relative 
strengths of association and indicating signi- 
ficant relationships between two variables, 
coefficient of 1 is perfect, 0 is none. 
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Scatter Diagrams 



r= . 86 



r= - .9 



r= . 54 



r=0 



Equation for coefficient of correlation: 



\(£x 2 ) (£y 2 ) 
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I Example: 



X 


1 


3 


4 


6 


8 


9 


11 


14 


Y 


1 


2 


4 


4 


5 


7 


8 


9 



X 


Y 


x=X-X 


y-Y-Y 


x^ 


><y 


y 2 


1 


1 


-6 


-4 


36 


24 


16 


3 


2 


-4 


-3 


16 


12 


9 


4 


4 


-3 


-1 


9 


3 


1 


6 


4 


-1 


-1 


1 


1 


1 


8 


b 


1 


0 


1 


0 


0 


9 


7 


2 


2 


4 


4 


4 


1 1 


8 


4 


3 


16 


12 


9 


14 


g 


7 


4 


49 


28 


16 


56 
X=7 


Ly= 

40 
Y=5 






£x^= 
132 


^xy= 
84 


56 
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Because data are taken from a sample, r 
is effected by sampling error. Therefore, 
after the correlation is done, the null hypothesis 
must be tested to see that r is significantly 
larger than zero and did not occur just as a 
result of sampling error. Thi3 is done with a 
t test. 



r M n-2 



v Ay|l-r 2 
If you are doing the calculations by hand, you 

must determine from a table of values of the t 

statistic (usually found in the back of statistics 

books), whether the t calculated is large enough 

to reflect significance „ 

Computer programs are commonly available 
to conduct correlation analyses. The input 
required is the values for X and Y: the programs 
usually print the value of the correlation coefficient 
as well as its level of significance. 

A significant result should be treated with 
cacitton since correlations can be due to coincidence 
or due to a third variable. Remember, no 
r ise and effect can be associated with correlation. 
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bxarripies of items which might be related 
and thus which might be subject to correlation 
analysis: 

1 . Books checked out and grade point 

2. Operating expenditures and circulation 

3. Book budget and circulation 

4. Use of materials in house and 
circulation 

What are some other examples? 
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Kegre.^'i ion 

If a relationship is found to exist between 
variables, you might want to express this; 
relationship in mathematical form with an 
equation connecting the variables . 



Linear Nonlinear 
The simplest type of approximating curve is a 
straight Ir .e, whose equation can be written: 
y = a + bx 




= y intercept 
~ slope, amount of 
change in y for on: 
unit change in x 



Using the least squares method, solving 
for a and b: 

^ = (£ v )(£. f ) x xi XY) - - 



b = niXV 



"ITx 2 - ( 1 X)2 
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The method of least squared finds the 
coefficients a and b which make the sum 
of tl e squares of the distances from the tine 
the least possible. This technique can also 
be used with more than two variables and is 
called multiple regression m Again, computsr 
programs are available to calculate solutions. 

After the regression 1 ine has been found , 
you need to know how well the line fits the 
data. if all the points of the scatter lie on the 
line, it is perfect linear correlation. The quantity 
r is called the coefficient of correlation. 



r - 



explained variation 



^ total variation 
r varies from -i to 41 and is a measure of the 
linear correlation between two variables. 

The standard error of the estimate indicates 
the accuracy of the estimate. It is a measure 
of the scatter about the regression line. 



i -icilimol has usod regression to predict total 
i >j u *'r< it in<rj expenses of a Library iVorh number of 
personnel and other variables. Can you think o 
other examples? 

Regression is a powerful tool for prediction 
but it does not mean that a value outside the 
previously examined sphere will fit the line. 
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Presenta tion of Results 

The final stage uf a study is the fair 
interpretation and clear presentation of the re- 
sults. The interpretation should include some 
consideration of the level of risk and the pre- 
cision of the results. It is also essential that 
the reader be made aware of any distortions 
in the data due to sampling bias, as well as the limi- 
tations of the analysis procedures used. For 
example, it should be rriade clear that corre- 
lation does not indicate cause and effect. 



The well— analyzed and presented study 
acknowledges its weaknesses, indicates what- 
results are doubted, and suggests what remains 
to be done. In writing the report, emphasis 
should not be placed only on results which con- 
firm the investigator's opinions; all significant findings, 
supporting or not, should be reported. 
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I pr« ■ sent atior, of rvsull'-i riould high- 
light Hit' i i riportant i rir^ui^^s , not obfuscate them 
with data. Vast amounts of" si ati sticat data in 
raw form only confuse the issue. The data 
should be presented in an organized way to help 
communicate the facts to the reader. The 
organization can take two baste forms: sta- 
tistical tables and figures . A statistical 
tabic is a presentation of numbers in a logical 
arrangement, with some brief explanation to 
show what they are. A f igu re is a chart, 
graph, map, or some other illustration design- 
ed to present statistical data in picture form. 



Figures - The making and understanding 
of figures requires neither artistic nor mathematical 
skil — -^i ic/ common sense and the ability to use a ruler 
and compass are needed . 



r>2 



Line graphs are particularly good for 



indicating trends over periods. 



15k 




D JFMAMd o A SON 
Ci rcul ations 

However, truncation or changing the proportion 
between the ordinate and the abscissa 
can distort the true picture if this is 
not made clear to the reader. 



ERIC 



ordinate 



20 > 000 j 000 



1 9,500,000 




months 



30 



20 



10 



abscissa 



months 



Column or Bar charts are useful when 



a number of different groups are compared, 




EXIT COUNT 



Average 
Daily 
Count 



The reader should be presented with 
comparable figures . Often a percentage 
or ratio figure may be more informa- 
tive in a column chart than only the 
raw figures. 
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Problerrxs 

Calculate the mean, median and mode for the 
two sets of data below : 

Group 1 Group 2 

2 5 9 2 3 

2 6 6 9 1 

3 7 10 



Calculate the standard deviation and variance 
for the numbers in problem 1 . Which set of 
data is the more scattered? 

You wish to find the average time to answer 
a reference question. A preliminary sample in- 
dicates that the variance in questions is 9 
minutes. You wish to have no more than one 
minute on eith*; side of the mean .th a 
level of risk of only five per cent (95% Con- 
fidence Level), z = 1.96. Using the equation 
for sample size, calculate what size a random 
sample should be taken. 



62 



The standard deviation of a group of data on 
the width of books is 2 inches. The mean 
of the sample was 5 inches. At a 95% con- 
fidence level, what is the confidence interval? 
-p here were 25 books in the sample . 
A personnel director is interested in trying to 

determine if the season of the year has any 
effect on the number of employees who resign. 
His records give the following information: 

S eas o n Number of resignations 

Winter 1 0 

Spring 22 
Summer 1 § 

Fall _9 

GO 

Test at a significance level of 0.05 to deter- 
mine if there is a significant deviation between 
the observed distribution and a uniform distri- 
bution (equal for all seasons). H Q = null 
hypothesis, no difference, that is, the proportion 
of resignations is independent of the season cf 
the year. A significance level of 0.05 requires 
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V 2 - 

n t value of / . U1 5 (column ir i t he chi i-iquciro 
Uihie he. ided 0;or> with three deyrecs of" freedom) 
reject H a if Y, 2 " ^ 7.815. 



6. Make a significance test to determine if it can 

be assumed that the variance in width of books 
and journals is the same as the varianr n 
width of journals . 

Sample 1: Books Sample 2: Journals 

n 1 = 16 n 2 = 21 



s 1 = 1 - 5" s 2 



F ratio lor d . f . t = 15 and d.f.g = 20 

at 0.05 - 2 . 3 3 . 

The library speculated that if they offered a 
free MEDLINE search they would incr-: ose tl .3 
percentage of questionnaires returned. ~o teV.. ^his 
theory they sent questionnaires to a raridch sam - 
pie of 30 persons on the list with the offer of a 
MEDLINE search. They sent another ?\. : r h -- ■: 
30 with no MEDLINE offer: The results 
shown below: 
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Offered 
Not offered 

The problem is to test at 0.05 level to see if 
there is a significant difference in the propor- 
tion of returns when the MEDLINE is offered. 
An 0.05 level required a Chi Square value of 
3.841 (Column 0.05 with 1 degree of freedom). 

You took some observations on the number of 
persons exiting the library and books used in the 
library each hour for eight hours. X = people 
exiting the library, Y - books used in the library. 



X 


1 3 


4 


6 


8 


g 


1 1 


14 


Y 


1 ? 


4 


4 


h 


7 


8 


_ _9_ _ 



















Using the equation for the correlation coefficient, 
see if these variables are highly correlated > is 
the t significantly greater than 0? 

Find the least squares 1 ines for the data in 
problem 8. a = , b - . What is the equation 
for the line? 



i\ M0 
14 1 6 30 

36 24 60 
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X 


Y 


X 2 


XY 


Y 2 


1 


j 


1 


1 


1 


' } 




9 


9 


4 


4 


4 


I f: 


1 6 


16 ! 


6 


4 




24 


16 


8 


5 


04 


40 




9 


7 


81 


68 


49 


1 1 


8 


121 


88 


64 


14 


9 


19(3 


126 


i 

81 


-56 


^Y =40 


£x2=524 


ISXY=364 





Using the data in problem 6 and the fact tr : ie 
sample mean of books = 6 and the sample mean of 
journals = 8 use a t— test to determine who ,er there 
is a significant" difference between the means. 

The amounts of money soent on foreign mono- 

y r a p h s 3 nd d o rh c c tic rh c no g I "-a p II s , To r e i g o ^><^n t ex 1 ^ 

and domestic serials are given below. Using a compass 

construct a pie chart of the following data . 



Foreign monographs 


25,000 


10% 


Domestic monographs 


50 , 000 


20% 


Foreign serials 


75 , 000 _ 


30% 


Domestic serials 


1 00 , 000 


40% 


TOTAL" 


250,000 


100% 
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C'ih you match the following probLerris to an appropriate daca 
analysis technique? 



Problem? 

a. The average number of 
books checked out by each 
student last year was 3. is 
it the same this year? 

b . What's the average width 
of a journal volume? 



Techniques 
1 . Chi Square 

2 . Analysts of variance? 

3. t-test for the mean o 
a population 

4 . Regression 



c . Is the price of materials 
in chemistry and physics 

about the same? 

ri i Is there a relationship 
between the number of 
in-- house uses and cir- 
culations? 

o. I want to estimate in-house 
uses from circulations. 



5 . Correlation 



6. t-test for the means 



of two samples 



[Estimation of a 



parameter 



f. Do nurses use the library 

more frequently than physicians? 



q. \r. there a c itffo m.ncc in 
the length of tenure of 
em]) i yees in Technical 
Services, Public Servires 
and Administration Divisions? 
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e2 



Varian^u 



Standard station 




Standard L'm >r 1 »f i.hc- 
s _ : £ 

C b r r o 1 a t i o n C o e F f ' i c i e r. t; 




Regression Coefficients 

a = £Y)ijX Z - c£>OG£X Y j 

b = nlxY - (^ ^V) 
n^X 1 - C£X) X 
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